{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "allWIe5kwPcS"
   },
   "source": [
    "# Time Series Datasets\n",
    "\n",
    "This notebook shows how to create a time series dataset from some csv file in order to then share it on the [🤗 hub](https://huggingface.co/docs/datasets/index). We will use the GluonTS library to read the csv into the appropriate format. We start by installing the libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4XnNcdWbwrNo"
   },
   "outputs": [],
   "source": [
    "! pip install -q datasets gluonts orjson"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dI1yo_vHw5CV"
   },
   "source": [
    "GluonTS comes with a pandas DataFrame based dataset so our strategy will be to read the csv file, and process it as a `PandasDataset`. We will then iterate over it and convert it to a 🤗 dataset with the appropriate schema for time series. So lets get started!\n",
    "\n",
    "## `PandasDataset`\n",
    "\n",
    "Suppose we are given multiple (10) time series stacked on top of each other in a dataframe with an `item_id` column that distinguishes different series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "e9FaT_VpwuI2",
    "outputId": "8a10c908-41e1-4ca7-b420-01c0810c5c4b"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "url = (\n",
    "    \"https://gist.githubusercontent.com/rsnirwan/a8b424085c9f44ef2598da74ce43e7a3\"\n",
    "    \"/raw/b6fdef21fe1f654787fa0493846c546b7f9c4df2/ts_long.csv\"\n",
    ")\n",
    "df = pd.read_csv(url, index_col=0, parse_dates=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "s4WCjvBqxi3B"
   },
   "source": [
    "After converting it into a `pd.Dataframe` we can then convert it into GluonTS's `PandasDataset`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "su1i8ZdDxf7c"
   },
   "outputs": [],
   "source": [
    "from gluonts.dataset.pandas import PandasDataset\n",
    "\n",
    "ds = PandasDataset.from_long_dataframe(df, target=\"target\", item_id=\"item_id\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cYnHkLdex_n3"
   },
   "source": [
    "\n",
    "## 🤗 Datasets\n",
    "\n",
    "From here we have to map the pandas dataset's `start` field into a time stamp instead of a `pd.Period`. We do this by defining the following class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gv5Ytpwrx4FQ"
   },
   "outputs": [],
   "source": [
    "class ProcessStartField():\n",
    "    ts_id = 0\n",
    "    \n",
    "    def __call__(self, data):\n",
    "        data[\"start\"] = data[\"start\"].to_timestamp()\n",
    "        data[\"feat_static_cat\"] = [self.ts_id]\n",
    "        self.ts_id += 1\n",
    "        \n",
    "        return data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "DrhbT1QIyPMA"
   },
   "outputs": [],
   "source": [
    "from gluonts.itertools import Map\n",
    "\n",
    "process_start = ProcessStartField()\n",
    "\n",
    "list_ds = list(Map(process_start, ds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Ug2kLNUPyeyJ"
   },
   "source": [
    "Next we need to define our schema features and create our dataset from this list via the `from_list` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "r1rQUtvGycaF"
   },
   "outputs": [],
   "source": [
    "from datasets import Dataset, Features, Value, Sequence\n",
    "\n",
    "features  = Features(\n",
    "    {    \n",
    "        \"start\": Value(\"timestamp[s]\"),\n",
    "        \"target\": Sequence(Value(\"float32\")),\n",
    "        \"feat_static_cat\": Sequence(Value(\"uint64\")),\n",
    "        # \"feat_static_real\":  Sequence(Value(\"float32\")),\n",
    "        # \"feat_dynamic_real\": Sequence(Sequence(Value(\"uint64\"))),\n",
    "        # \"feat_dynamic_cat\": Sequence(Sequence(Value(\"uint64\"))),\n",
    "        \"item_id\": Value(\"string\"),\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RrpP2oAxywC8"
   },
   "outputs": [],
   "source": [
    "dataset = Dataset.from_list(list_ds, features=features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GrbGlDtYzyIf"
   },
   "source": [
    "We can thus use this strategy to [share](https://huggingface.co/docs/datasets/share) the dataset to the hub."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
